Dna sequencing

ABSTRACT

Provided herein is technology relating to sequencing nucleic acids and particularly, but not exclusively, to methods, compositions, and systems for sequencing a nucleic acid using two or more labels and signal ratios to distinguish bases.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present divisional application is a of U.S. application Ser. No.14/398,304 filed Oct. 31, 2014, which is a national phase applicationunder 35 U.S.C. § 371 of PCT International Application No.PCT/US2013/039298 filed on May 2, 2013 which claims priority to U.S.Provisional Application Ser. No. 61/641,720 filed May 2, 2012, theentirety of which is incorporated by reference herein.

FIELD OF INVENTION

Provided herein is technology relating to sequencing nucleic acids andparticularly, but not exclusively, to methods, compositions, and systemsfor sequencing a nucleic acid using at least two labels wherein theratio of the labels identifies and differentiates bases.

BACKGROUND

DNA sequencing is driving genomics research and discovery. Thecompletion of the Human Genome Project was a monumental achievementinvolving an incredible amount of combined efforts among genome centersand scientists worldwide. This decade-long project was completed usingthe Sanger sequencing method, which remains the staple genome sequencingmethodology in high-throughput genome sequencing centers. The mainreason behind the prolonged success of this method was its basic andefficient, yet elegant, method of dideoxy chain termination. Withincremental improvements in Sanger sequencing—including the use oflaser-induced fluorescent excitation of energy transfer dyes, engineeredDNA polymerases, capillary electrophoresis, sample preparation,informatics, and sequence analysis software—the Sanger sequencingplatform has been able to maintain its status. Current state-of-the-artSanger based DNA sequencers can produce over 700 bases of clearlyreadable sequence in a single run from templates up to 30 kb in length.However, as with most technological inventions, the continualimprovements in this sequencing platform have come to a stagnantplateau, with the current cost estimate for producing a high-qualitymicrobial genome draft sequence at around $10,000 per megabasepair.Current DNA sequencers based on the Sanger method allow up to 384samples to be analyzed in parallel.

It is evident that exploiting the complete human genome sequence forclinical medicine and health care requires accurate low-cost andhigh-throughput DNA sequencing methods. Indeed, both the public(National Human Genome Research Institute, NHGRI) and private genomicsciences sectors (The J. Craig Venter Science Foundation and Archon Xprize for genomics) have issued a call for the development ofnext-generation sequencing technology that will reduce the cost ofsequencing to one-ten thousandth of its current cost over the next tenyears. Accordingly, to overcome the limitations of current conventionalsequencing technologies, a variety of new DNA sequencing methods havebeen investigated, including sequencing-by-synthesis (SBS) approachessuch as pyrosequencing (Ronaghi et al. (1998) Science 281: 363-365),sequencing of single DNA molecules (Braslaysky et al. (2003) Proc. Natl.Acad. Sci. USA 100: 3960-3964), and polymerase colonies (“polony”sequencing) (Mitra et al. (2003) Anal. Biochem. 320: 55-65).

The concept of DNA sequencing-by-synthesis (SBS) was revealed in 1988with an attempt to sequence DNA by detecting the pyrophosphate groupthat is generated when a nucleotide is incorporated in a DNA polymerasereaction (Hyman (1999) Anal. Biochem. 174: 423-436). Subsequent SBStechnologies were based on additional ways to detect the incorporationof a nucleotide to a growing DNA strand. In general, conventional SBSuses an oligonucleotide primer designed to anneal to a predeterminedposition of the sample template molecule to be sequenced. Theprimer-template complex is presented with a nucleotide in the presenceof a polymerase enzyme. If the nucleotide is complementary to theposition on the sample template molecule that is directly 3′ of the endof the oligonucleotide primer, then the DNA polymerase will extend theprimer with the nucleotide. The incorporation of the nucleotide and theidentity of the inserted nucleotide can then be detected by, e.g., theemission of light, a change in fluorescence, a change in pH (see, e.g.,U.S. Pat. No. 7,932,034), a change in enzyme conformation, or some otherphysical or chemical change in the reaction (see, e.g., WO 1993/023564and WO 1989/009283; Seo et al. (2005) “Four-color DNA sequencing bysynthesis on a chip using photocleavable fluorescent nucleotides,” PNAS102: 5926-59). Upon each successful incorporation of a nucleotide, asignal is detected that reflects the occurrence, identity, and number ofnucleotide incorporations. Unincorporated nucleotides can then beremoved (e.g., by chemical degradation or by washing) and the nextposition in the primer-template can be queried with another nucleotidespecies.

SUMMARY

In conventional DNA sequencing-by-synthesis using labeled nucleotidemonomers, four different moieties (e.g., a dye or a fluorescent label)are attached to the four nucleotide bases to allow the detector todistinguish the bases from each other by color. For example, somemethods label each of the A, C, G, and T with a fluorescent moiety thatemits light at a wavelength that is distinguishable from the lightemitted by the other three fluorescent moieties, e.g., to produce lightof four different colors associated with each of the four bases.

In contrast, the present technology relies on differences in labelingratios rather than on only differences in color to identify basesincorporated during a sequencing reaction. In this scheme, eachindividual nucleotide base is labeled at a specific known ratio of atleast two different moieties (e.g., a dye, a fluorescent label, etc.).As an exemplary embodiment, ATP is labeled with two moieties X and Y ina ratio of 1:0 (all ATP molecules are labeled with moiety X), TTP islabeled with the two moieties X and Yin a ratio of 2:1 (two-thirds ofthe population of TTP molecules is labeled with moiety X and one-thirdof the population of TTP molecules is labeled with moiety Y), GTP islabeled with the two moieties X and Y in a ratio of 1:2 (one-third ofthe population of GTP molecules is labeled with moiety X and two-thirdsof the population of GTP molecules is labeled with moiety Y), and CTP islabeled with the two moieties X and Y in a ratio of 0:1 (all CTPmolecules are labeled with moiety Y). Then, according to someembodiments, a polony (e.g., a clonal colony) based sequencing approachis performed with the sequence determined by detecting the ratio ofsignals produced by the two dyes after each base incorporation.

In some embodiments, an element of the technology that allows separatingand assigning the signal intensities into appropriate base-specific“bins” is the use of a 4-base calibration sequence at the beginning of asequencing run. This calibration sequence contains each of the 4 basesin a known order to provide a calibration reference, e.g., to calibratea sequencing instrument to recognize the appropriate signal ratios (andthus, label ratios) for each of the bases.

As a consequence, embodiments of the technology reduce the number offluorescent dyes needed to identify the four bases (e.g., allowing oneto use only the most optimal dyes to acquire a sequence), reduce thenumber of lasers used to excite labels (e.g., fluorescent moieties),reduce or eliminate optics used to split the optical signal bywavelength, and reduce of the number of detectors for recordingincorporation events.

Accordingly, provided herein are methods for sequencing a target nucleicacid, the method comprising: determining a signal ratio produced from aplurality of a nucleotide base in which a first fraction of theplurality is labeled with a first label and a second fraction of theplurality is labeled with a second label; and associating the signalratio with the nucleotide base to identify the nucleotide base. In someembodiments, the signal ratio produced by the plurality of thenucleotide base is detectably different than a second signal ratioproduced by a second plurality of a second nucleotide base. For example,some embodiments provide that the first fraction of the pluralityproduces a first signal having a first intensity and the second fractionof the plurality produces a second signal having a second intensity andthe signal ratio is a ratio of the first intensity relative to thesecond intensity. In some embodiments, the first intensity is a firstfluorescence emission amplitude and the second intensity is a secondfluorescence emission amplitude; some embodiments provide that the firstintensity is a first peak height and the second intensity is a secondpeak height.

In some aspects, the technology relates to sequencing nucleic acidsusing labeled nucleotides. Accordingly, in some embodiments, the methodscomprise providing a first plurality of a first nucleotide base and asecond plurality of a second nucleotide base, wherein a first ratio of afirst portion of the first plurality labeled with the first labelrelative to a second portion of the first plurality labeled with thesecond label is detectably different than a second ratio of a thirdportion of the second plurality labeled with the first label relative toa fourth portion of the second plurality labeled with the second label.Moreover, some embodiments further comprise providing a first pluralityof a first nucleotide base, a second plurality of a second nucleotidebase, a third plurality of a third nucleotide base, and a fourthplurality of a fourth nucleotide base, wherein a first ratio of a firstportion of the first plurality labeled with the first label relative toa second portion of the first plurality labeled with the second label, asecond ratio of a third portion of the second plurality labeled with thefirst label relative to a fourth portion of the second plurality labeledwith the second label, a third ratio of a fifth portion of the thirdplurality labeled with the first label relative to a sixth portion ofthe third plurality labeled with the second label, and a fourth ratio ofa seventh portion of the fourth plurality labeled with the first labelrelative to an eighth portion of the fourth plurality labeled with thesecond label are all detectably different from each other. For example,in some embodiments the first nucleotide base is A, the secondnucleotide base is C, the third nucleotide base is G, and the fourthnucleotide base is T.

The technology relates to using different ratios of two labels todifferentiate nucleotides in a sequencing reaction. Thus, someembodiments provide that the first label is a first fluorescent moietywith an emission peak at a first wavelength and the second label is asecond fluorescent moiety with an emission peak at a second wavelength.Moreover, some embodiments comprise monitoring a first channel and asecond channel, wherein the first label produces a signal in the firstchannel and the second label produces a signal in the second channel.

In some aspects, the methods are related to sequencing-by-synthesis;thus, methods are provided that comprise incorporating by polymerizationthe plurality of the nucleotide base into a plurality of a nucleic acidthat is complementary to the target nucleic acid. Various detectionmethods are contemplated by the present technology. For instance, insome embodiments, the methods comprise monitoring signals with anoptical device.

In some embodiments, the methods comprise providing a calibrationoligonucleotide comprising a known sequence. And, some embodimentscomprise analyzing a dataset of ordered signal ratios to produce anucleotide sequence of the target nucleic acid.

In addition, some aspects of the technology relate to compositionscomprising a plurality of a nucleotide base wherein a first portion ofthe plurality is labeled with a first label and a second portion of theplurality is labeled with a second label. In some embodiments, thecompositions further comprise a second plurality of a second nucleotidebase wherein a third portion of the second plurality is labeled with thefirst label and a fourth portion of the second plurality is labeled withthe second label and a first ratio of the first portion relative to thesecond portion is different than a second ratio of the third portionrelative to the fourth portion. In some embodiments, the compositionsfurther comprise a third plurality of a third nucleotide base and afourth plurality of a fourth nucleotide base, wherein a fifth portion ofthe third plurality is labeled with the first label and a sixth portionof the third plurality is labeled with the second label, and a seventhportion of the fourth plurality is labeled with the first label and aneighth portion of the fourth plurality is labeled with the second labeland a third ratio of the fifth portion relative to the sixth portion isdifferent than a ratio of the seventh portion relative to the eighthportion and the third and the fourth ratios are both different than thefirst and the second ratios. For example, in some embodiments the firstnucleotide base is A, the second nucleotide base is C, the thirdnucleotide base is G, and the fourth nucleotide base is T and, in someembodiments, the label is a fluorescent moiety. In some embodiments, thefirst, the second, the third, and/or the fourth base is a modified baseor a base analogue such as an inosine, isoguanine, isocytosine, adiaminopyrimidine, a xanthine, a nitroazole, a size-expanded base, etc.

While the technology relates in some aspects to nucleotide compositions,the technology also relates to compositions further comprising a targetnucleic acid, a sequencing primer, and a polymerase. Some embodiments ofcompositions comprise a nucleic acid comprising the nucleotide base.

The methods and compositions are related, for example, to embodiments ofsystems for sequencing a nucleic acid, wherein the systems comprise: acomposition comprising a plurality of a nucleotide base wherein a firstportion of the plurality is labeled with a first label and a secondportion of the plurality is labeled with a second label; and acalibration oligonucleotide. Some system embodiments further comprise asequencing apparatus, some system embodiments further comprise aprocessor configured to associate a signal ratio with a nucleotide base,and some system embodiments further comprise an output functionality toprovide a nucleotide sequence of the nucleic acid.

Systems for sequencing a target (template) nucleic acid comprise in someembodiments a second plurality of a second nucleotide base, a thirdplurality of a third nucleotide base, and a fourth plurality of a fourthnucleotide base, wherein a third portion of the second plurality islabeled with the first label and a fourth portion of the secondplurality is labeled with the second label, a fifth portion of the thirdplurality is labeled with the first label and a sixth portion of thethird plurality is labeled with the second label, and a seventh portionof the fourth plurality is labeled with the first label and an eighthportion of the fourth plurality is labeled with the second label and afirst ratio of the first portion relative to the second portion, asecond ratio of the third portion to the fourth portion, a third ratioof the fifth portion to the sixth portion, and a fourth ratio of theseventh portion to the eighth portion are different from one another.Thus, some embodiments also comprise a functionality to detect the firstlabel and the second label and, furthermore, some embodiments comprise afunctionality to differentiate the nucleotide base, the secondnucleotide base, the third nucleotide base, and the fourth nucleotidebase from one another.

The technology finds use in kits comprising embodiments of thecompositions provided for, e.g., practicing embodiments of the methodsprovided. For example, some embodiments provide a kit for sequencing anucleic acid, wherein the kit comprises a composition comprising aplurality of a nucleotide base wherein a first portion of the pluralityis labeled with a first label and a second portion of the plurality islabeled with a second label; and a calibration oligonucleotide.

In some embodiments, kits are provided that comprise a second pluralityof a second nucleotide base, a third plurality of a third nucleotidebase, and a fourth plurality of a fourth nucleotide base, wherein athird portion of the second plurality is labeled with the first labeland a fourth portion of the second plurality is labeled with the secondlabel, a fifth portion of the third plurality is labeled with the firstlabel and a sixth portion of the third plurality is labeled with thesecond label, and a seventh portion of the fourth plurality is labeledwith the first label and an eighth portion of the fourth plurality islabeled with the second label and a first ratio of the first portionrelative to the second portion, a second ratio of the third portion tothe fourth portion, a third ratio of the fifth portion to the sixthportion, and a fourth ratio of the seventh portion to the eighth portionare different from one another.

Additional embodiments will be apparent to persons skilled in therelevant art based on the teachings contained herein.

DETAILED DESCRIPTION

Provided herein is technology relating to sequencing nucleic acids andparticularly, but not exclusively, to methods, compositions, systems,and kits for sequencing a nucleic acid using the ratio between multiplelabel signals to differentiate bases.

The section headings used herein are for organizational purposes onlyand are not to be construed as limiting the described subject matter inany way.

In this detailed description of the various embodiments, for purposes ofexplanation, numerous specific details are set forth to provide athorough understanding of the embodiments disclosed. One skilled in theart will appreciate, however, that these various embodiments may bepracticed with or without these specific details. In other instances,structures and devices are shown in block diagram form. Furthermore, oneskilled in the art can readily appreciate that the specific sequences inwhich methods are presented and performed are illustrative and it iscontemplated that the sequences can be varied and still remain withinthe spirit and scope of the various embodiments disclosed herein.

All literature and similar materials cited in this application,including but not limited to, patents, patent applications, articles,books, treatises, and internet web pages are expressly incorporated byreference in their entirety for any purpose. Unless defined otherwise,all technical and scientific terms used herein have the same meaning asis commonly understood by one of ordinary skill in the art to which thevarious embodiments described herein belongs. When definitions of termsin incorporated references appear to differ from the definitionsprovided in the present teachings, the definition provided in thepresent teachings shall control.

It will be appreciated that there is an implied “about” prior to thetemperatures, concentrations, times, etc. discussed in the presentteachings, such that insubstantial deviations are within the scope ofthe present teachings. In this application, the use of the singularincludes the plural unless specifically stated otherwise. Also, the useof “comprise”, “comprises”, “comprising”, “contain”, “contains”,“containing”, “include”, “includes”, and “including” are not intended tobe limiting. It is to be understood that both the foregoing generaldescription and the following detailed description are exemplary andexplanatory only and are not restrictive of the present teachings.

Further, unless otherwise required by context, singular terms shallinclude pluralities and plural terms shall include the singular.Generally, nomenclatures utilized in connection with, and techniques of,cell and tissue culture, molecular biology, and protein andoligonucleotide or polynucleotide chemistry and hybridization describedherein are those well known and commonly used in the art. Unlessotherwise indicated, standard techniques are used, for example, fornucleic acid purification and preparation, chemical analysis,recombinant nucleic acid, and oligonucleotide synthesis. Enzymaticreactions and purification techniques are performed according tomanufacturer's specifications or as commonly accomplished in the art oras described herein. The techniques and procedures described herein aregenerally performed according to conventional methods as described invarious general and more specific references that are cited anddiscussed throughout the instant specification. See, e.g., Sambrook etal., Molecular Cloning: A Laboratory Manual (Third ed., Cold SpringHarbor Laboratory Press, Cold Spring Harbor, N.Y. (2000)). Thenomenclatures utilized in connection with, and the laboratory proceduresand techniques described herein are those well known and commonly usedin the art.

Definitions

To facilitate an understanding of the present technology, a number ofterms and phrases are defined below. Additional definitions are setforth throughout the detailed description.

Throughout the specification and claims, the following terms take themeanings explicitly associated herein, unless the context clearlydictates otherwise. The phrase “in one embodiment” as used herein doesnot necessarily refer to the same embodiment, though it may.Furthermore, the phrase “in another embodiment” as used herein does notnecessarily refer to a different embodiment, although it may. Thus, asdescribed below, various embodiments of the invention may be readilycombined, without departing from the scope or spirit of the invention.

In addition, as used herein, the term “or” is an inclusive “or” operatorand is equivalent to the term “and/or” unless the context clearlydictates otherwise. The term “based on” is not exclusive and allows forbeing based on additional factors not described, unless the contextclearly dictates otherwise. In addition, throughout the specification,the meaning of “a”, “an”, and “the” include plural references. Themeaning of “in” includes “in” and “on.”

A “system” denotes a set of components, real or abstract, comprising awhole where each component interacts with or is related to at least oneother component within the whole.

As used herein, the phrase “dNTP” means deoxynucleotidetriphosphate,where the nucleotide comprises a nucleotide base, such as A, T, C, G orU.

The term “monomer” as used herein means any compound that can beincorporated into a growing molecular chain by a given polymerase. Suchmonomers include, without limitations, naturally occurring nucleotides(e.g., ATP, GTP, TTP, UTP, CTP, dATP, dGTP, dTTP, dUTP, dCTP, syntheticanalogs), precursors for each nucleotide, non-naturally occurringnucleotides and their precursors, or any other molecule that can beincorporated into a growing polymer chain by a given polymerase.

As used herein, a “nucleic acid” shall mean any nucleic acid molecule,including, without limitation, DNA, RNA and hybrids thereof. The nucleicacid bases that form nucleic acid molecules can be the bases A, C, G, Tand U, as well as derivatives thereof. Derivatives of these bases arewell known in the art. The term should be understood to include, asequivalents, analogs of either DNA or RNA made from nucleotide analogs.The term as used herein also encompasses cDNA, that is complementary, orcopy, DNA produced from an RNA template, for example by the action ofreverse transcriptase. It is well known that DNA (deoxyribonucleic acid)is a chain of nucleotides consisting of 4 types of nucleotides—A(adenine), T (thymine), C (cytosine), and G (guanine)—and that RNA(ribonucleic acid) is a chain of nucleotides consisting of 4 types ofnucleotides—A, U (uracil), G, and C. It is also known that all of these5 types of nucleotides specifically bind to one another in combinationscalled complementary base pairing. That is, adenine (A) pairs withthymine (T) (in the case of RNA, however, adenine (A) pairs with uracil(U)), and cytosine (C) pairs with guanine (G), so that each of thesebase pairs forms a double strand. As used herein, “nucleic acidsequencing data”, “nucleic acid sequencing information”, “nucleic acidsequence”, “genomic sequence”, “genetic sequence”, “fragment sequence”,or “nucleic acid sequencing read” denotes any information or data thatis indicative of the order of the nucleotide bases (e.g., adenine,guanine, cytosine, and thymine/uracil) in a molecule (e.g., a wholegenome, a whole transcriptome, an exome, oligonucleotide,polynucleotide, fragment, etc.) of DNA or RNA.

Reference to a base, a nucleotide, or to another molecule may be in thesingular or plural. That is, a base may refer to a single molecule ofthat base or to a plurality of that base, e.g., in a solution.

As used herein, the phrase “a clonal plurality of nucleic acids” or “aclonal population of nucleic acids” or “a cluster” or “a polony” refersto a set of nucleic acid products that are substantially or completelyor essentially identical to each other, and they are complementarycopies of the template nucleic acid strand from which they aresynthesized.

As used herein, a “polynucleotide”, also called a nucleic acid, is acovalently linked series of nucleotides in which the 3′ position of thepentose of one nucleotide is joined by a phosphodiester group to the 5′position of the next. DNA (deoxyribonucleic acid) and RNA (ribonucleicacid) are biologically occurring polynucleotides in which the nucleotideresidues are linked in a specific sequence by phosphodiester linkages.As used herein, the terms “polynucleotide” or “oligonucleotide”encompass any polymer compound having a linear backbone of nucleotides.An “oligodeoxyribonucleotide” or “oligonucleotides”, also termed an“oligomer”, is generally a polynucleotide of a shorter length.

As used herein, “complementary” generally refers to specific nucleotideduplexing to form canonical Watson-Crick base pairs, as is understood bythose skilled in the art. However, complementary also includesbase-pairing of nucleotide analogs that are capable of universalbase-pairing with A, T, G or C nucleotides and locked nucleic acids thatenhance the thermal stability of duplexes. One skilled in the art willrecognize that hybridization stringency is a determinant in the degreeof match or mismatch in the duplex formed by hybridization.

As used herein, “moiety” refers to one of two or more parts into whichsomething may be divided, such as, for example, the various parts of atether, a molecule or a probe.

A “polymerase” is an enzyme generally for joining 3′-OH 5′-triphosphatenucleotides, oligomers, and their analogs. Polymerases include, but arenot limited to, DNA-dependent DNA polymerases, DNA-dependent RNApolymerases, RNA-dependent DNA polymerases, RNA-dependent RNApolymerases, T7 DNA polymerase, T3 DNA polymerase, T4 DNA polymerase, T7RNA polymerase, T3 RNA polymerase, SP6 RNA polymerase, DNA polymerase 1,Klenow fragment, Thermophilus aquaticus DNA polymerase, Tth DNApolymerase, Vent DNA polymerase (New England Biolabs), Deep Vent DNApolymerase (New England Biolabs), Bst DNA Polymerase Large Fragment,Stoeffel Fragment, 9° N DNA Polymerase, Pfu DNA Polymerase, Tfl DNAPolymerase, RepliPHI Phi29 Polymerase, Tli DNA polymerase, eukaryoticDNA polymerase beta, telomerase, Therminator polymerase (New EnglandBiolabs), KOD HiFi. DNA polymerase (Novagen), KOD1 DNA polymerase,Q-beta replicase, terminal transferase, AMV reverse transcriptase, M-MLVreverse transcriptase, Phi6 reverse transcriptase, HIV-1 reversetranscriptase, novel polymerases discovered by bioprospecting, andpolymerases cited in U.S. Pat. Appl. Pub. No. 2007/0048748 and in U.S.Pat. Nos. 6,329,178; 6,602,695; and 6,395,524. These polymerases includewild-type, mutant isoforms, and genetically engineered variants such asexo-polymerases and other mutants, e.g., that tolerate labelednucleotides and incorporate them into a strand of nucleic acid.

The term “primer” refers to an oligonucleotide, whether occurringnaturally as in a purified restriction digest or produced synthetically,which is capable of acting as a point of initiation of synthesis whenplaced under conditions in which synthesis of a primer extension productthat is complementary to a nucleic acid strand is induced, (e.g., in thepresence of nucleotides and an inducing agent such as DNA polymerase andat a suitable temperature and pH). The primer is preferably singlestranded for maximum efficiency in amplification, but may alternativelybe double stranded. If double stranded, the primer is first treated toseparate its strands before being used to prepare extension products.Preferably, the primer is an oligodeoxyribonucleotide. The primer mustbe sufficiently long to prime the synthesis of extension products in thepresence of the inducing agent. The exact lengths of the primers willdepend on many factors, including temperature, source of primer and theuse of the method.

Embodiments of the Technology

The technology relates generally to methods, compositions, systems, andkits for DNA sequencing using a sequencing-by-synthesis approach.Although the disclosure herein refers to certain illustratedembodiments, it is to be understood that these embodiments are presentedby way of example and not by way of limitation.

Methods

Some embodiments of the technology provide for methods of DNAsequencing-by-synthesis in which different ratios of signal intensities,rather than different signal wavelengths, identify bases incorporatedduring, for example, a sequencing-by-synthesis reaction. In someembodiments, an ensemble based (e.g., a polymerase colony (“polony”) orclonal colony) sequencing approach is used. These approaches sequencemultiple identical or substantially identical copies of a DNA moleculethat form a cluster of template molecules. Methods of forming clustersare provided, e.g., in U.S. Pat. No. 7,115,400. In some embodiments, theclusters are immobilized on a solid support such as a bead. Theseclusters typically result from amplifying a single originating DNAmolecule; thus, each cluster represents the single molecule thatinitiated the amplification. For example, in the “bridge amplification”process used in Solexa sequencing, approximately 1 million copies of theoriginal DNA molecule fragment are present in the cluster. Then,depending on the sequencing chemistry and methodology of particularembodiments, bases are added to the collection of clusters (or,equivalently, colonies, polonies, etc.). In an ensemble method accordingto the present technology, the ratio of labeling is directly associatedwith the qualities of the signal produced. For example, a base (e.g., aplurality of a base in solution) labeled at a ratio of 1:1 with twomoieties that emit fluorescent signals at two distinct wavelengths willproduce an emission spectrum having two peaks at the two wavelengths;the same amount of the base labeled at a ratio of 1:0 with the twomoieties will generally emit a fluorescent signal having a single peakthat is twice as intense as either of the peaks for the base whenlabeled in a 1:1 ratio. The ratio of the relative peak heights (e.g.,the signal intensities at two wavelengths) in the spectrum is generallysimilar to the ratio of the proportion of a base that is labeled. Forexample, a population of bases labeled at a ratio of 1:3 with twomoieties X and Y will generally produce a spectrum having two peaks(corresponding to the emission wavelengths of X and Y) whose relativeheights are also 1:3. Peak height and signal intensity are related suchthat the two terms can be used interchangeably. Consequently, the ratioof the signals can be measured, for example, by the ratio of peakheights in a spectrum, other peak characteristics (such as peak area,peak width at half height, etc.) and/or as a ratio of two intensitiesmeasured using, e.g., filters or optical splitting. In some embodiments,ratios are corrected using a correction factor related to a differencein the molar emission efficiency for two moieties.

In general, two approaches for base addition are used in ensemble-basedsequencing-by-synthesis: in the first, the bases are provided one at atime; in the second, bases are modified with identifying moieties sothat the base type of the incorporated nucleotide is identified assynthesis proceeds. In some embodiments, synthesis is synchronouslycontrolled by adding one base at a time (see, e.g., Margulies, M. et al.“Genome sequencing in microfabricated high-density picolitre reactors”,Nature 437: 376-380 (2005); Harris, T. D. et al. “Single-molecule DNAsequencing of a viral genome”, Science 320: 106-109 (2008)) or by usingnucleotides that are reversibly blocked. In particular embodiments,extension is momentarily blocked following each base addition by usingmodified nucleotides (e.g., nucleotide reversible terminators asdescribed in, e.g., WO2004/018497; U.S. Pat. Appl. Pub. No.2007/0166705; Bentley, D. R. et al. “Accurate whole human genomesequencing using reversible terminator chemistry”, Nature 456: 53-59(2008); Turcatti, G. et al. “A new class of cleavable fluorescentnucleotides: synthesis and optimization as reversible terminators forDNA sequencing by synthesis”, Nucleic Acids Res. 36: e25 (2008); Guo, J.et al. “Four-color DNA sequencing with 3′-O-modified nucleotidereversible terminators and chemically cleavable fluorescentdideoxynucleotides”, Proc. Natl. Acad. Sci. USA 105: 9145-9150 (2008);Ju, J. et al. “Four-color DNA sequencing by synthesis using cleavablefluorescent nucleotide reversible terminators”, Proc. Natl. Acad. Sci.USA 103: 19635-19640 (2006); Seo, T. S. et al. “Four-color DNAsequencing by synthesis on a chip using photocleavable fluorescentnucleotides”, Proc. Natl. Acad. Sci. USA 102: 5926-5931 (2005); Wu, W.et al. “Termination of DNA synthesis by N6-alkylated, not3′-O-alkylated, photocleavable 2′-deoxyadenosine triphosphates”, NucleicAcids Res. 35: 6339-6349 (2007)) or by omitting reaction components suchas divalent metal ions (see, e.g., WO 2005/123957; U.S. Pat. Appl. Pub.No. 20060051807).

Typically, each base addition is followed by a washing step to removeexcess reactants. Then, while synthesis is stopped, clusters are imagedto determine which base was added. In embodiments when one base is addedper reaction cycle, the successful incorporation of a base indicates thebase (and thus the sequence) at that position. These base additions aredetected typically by fluorescence (see, e.g., Harris, supra) or byenzyme cascades that identify the release of pyrophosphate by theproduction of light (see, e.g., Margulies, supra; Bentley, supra).According to the technology provided herein, base identity is associatedwith the ratio of signals (e.g., intensities or spectrum peaks)generated and detected during this detection phase, which, in turn, isassociated with the ratio of base labeling.

When all bases are added simultaneously, bases are conventionallydiscriminated by different tags (e.g., fluorescent moieties) attached toeach base (see, e.g., Korlach, J. et al. “Selective aluminum passivationfor targeted immobilization of single DNA polymerase molecules inzero-mode waveguide nanostructures”, Proc. Natl. Acad. Sci. USA 105:1176-1181 (2008); U.S. Pat. Appl. Pub. No. US 20030194740; U.S. Pat.Appl. Pub. No. U.S. Pat. Appl. Pub. No. US 20030064366; Turcatti, G., etal. “A new class of cleavable fluorescent nucleotides: synthesis andoptimization as reversible terminators for DNA sequencing by synthesis”,Nucleic Acids Res. 36: e25 (2008); Guo, J. et al. “Four-color DNAsequencing with 3′-O-modified nucleotide reversible terminators andchemically cleavable fluorescent dideoxynucleotides”, Proc. Natl. Acad.Sci. USA 105: 9145-9150 (2008); Ju, J. et al. “Four-color DNA sequencingby synthesis using cleavable fluorescent nucleotide reversibleterminators”, Proc. Natl. Acad. Sci. USA 103: 19635-19640 (2006); Seo,T. S. et al. “Four-color DNA sequencing by synthesis on a chip usingphotocleavable fluorescent nucleotides”, Proc. Natl. Acad. Sci. USA 102:5926-5931 (2005); Wu, W. et al. “Termination of DNA synthesis byN6-alkylated, not 3′-O-alkylated, photocleavable 2′-deoxyadenosinetriphosphates”, Nucleic Acids Res. 35: 6339-6349 (2007); WO2006/084132). According to embodiments of the technology providedherein, base identity is associated with a ratio of signals generated(e.g., peak height and/or signal intensity at multiple distinguishablewavelengths), which, in turn, is associated with the ratio of baselabeling.

For example, in some embodiments all four nucleotides are addedsimultaneously to the reaction comprising DNA polymerase and theclusters of template-primer complexes. In some embodiments, thenucleotides carry a fluorescent label and the 3′ hydroxyl group ischemically blocked (e.g., with a labeled reversible terminator) so thatsynthesis stops after a base is incorporated into the growing(synthesized) DNA strand. An imaging step follows each baseincorporation step, during which the clusters are imaged. To image theclusters, in some embodiments the fluorescent labels are excited by alaser and then the fluorescence emitted from the clusters is recorded.In some embodiments, the imaging records the wavelengths and theintensities at those wavelengths of the fluorescence emitted. Accordingto the present technology, at least two bases are labeled usingdifferent ratios of two labels and thus differences in the emitted ratioof signals at the two wavelengths associated with the two labelsdifferentiates the bases from one another. Then, before initiating thenext synthetic cycle, the 3′ terminal blocking groups are removed toprovide a substrate for the incorporation of the next base. The cyclesare repeated in this fashion to determine the sequence of the templatesone base at a time.

In some embodiments each nucleotide is added one at a time to a reactionmixture containing the nucleic acid target-primer complex and thepolymerase, monitoring the reaction for a signal, and removing the basefrom the reaction. For example, an illustrative embodiment of the methodcomprises:

-   -   1. providing a sequencing primer, a template, a polymerase, and        solutions of the four bases A, C, G, and T    -   2. hybridizing the primer to the template under appropriate        chemical and physical conditions    -   3. adding an aliquot of a solution comprising the A base to the        reaction    -   4. monitoring the reaction for the production of a signal    -   5. removing the A base from the solution    -   6. adding an aliquot of a solution comprising the C base to the        reaction    -   7. monitoring the reaction for the production of a signal    -   8. removing the C base from the solution    -   9. adding an aliquot of a solution comprising the G base to the        reaction    -   10. monitoring the reaction for the production of a signal    -   11. removing the G base from the solution    -   12. adding an aliquot of a solution comprising the T base to the        reaction    -   13. monitoring the reaction for the production of a signal    -   14. removing the T base from the solution    -   15. repeating steps 3-14 until the template is sequenced.

During each cycle, the detection of an output signal appropriate for thebase added in the previous step indicates a successful incorporation ofthat base and thus identifies the base incorporated at that step.Detection may be by conventional technology. For example, if the labelis a fluorescent moiety, then detection of an incorporated base may becarried out by using a confocal scanning microscope to scan thecollection of clusters (e.g., attached to a surface) with a laser toimage the fluorescent moieties bound directly to the incorporated bases.Alternatively, a sensitive 2D detector, such as a charge coupleddetector (CCD), can be used to visualize the signals generated. However,other techniques such as scanning near-field optical microscopy (SNOM)are available and may be used when imaging dense arrays. For example,using SNOM, individual polynucleotides may be distinguished whenseparated by a distance of less than 100 nm, e.g. 10 nm to 10 fm. For adescription of scanning near-field optical microscopy, see Moyer et al.,Laser Focus World (1993) 29:10. Suitable apparatuses used for imagingpolynucleotide arrays are known and the technical set-up will beapparent to the skilled person. The detection is preferably used incombination with an analysis system to determine the number and natureof the nucleotides incorporated for each step of synthesis. Thisanalysis, which may be carried out immediately after each synthesisstep, or later using recorded data, allows the sequence of the nucleicacid template within a given colony to be determined.

While this exemplary embodiment indicates adding the bases in the orderA, C, G, and T, the technology is not limited to this order. Indeed, insome embodiments the bases are added in any permuted order of the set {AC G T} or {A C G U}, e.g., A, G, C, T; A, T, C, G; T, C, G, A, etc. Inaddition, some embodiments provide that base analogues, modified bases,and other molecules are added instead of A, C, G, and T. It is to beunderstood that the nucleotides comprising uridine (“U”) can be used inplace of T and vice-versa. If the sequence being determined is unknown,the nucleotides added are usually applied in a chosen order that is thenrepeated throughout the analysis, for example as discussed above. If,however, the sequence being determined is known and is beingre-sequenced, for example, to determine if small differences are presentin the sequence relative to the known sequence, the sequencingdetermination process may be made quicker by adding the nucleotides ateach step in the appropriate order, chosen according to the knownsequence. Differences from the given sequence are thus detected by thelack of incorporation of certain nucleotides at particular stages ofprimer extension.

As an improved method of detecting base addition in SBS, the technologyis generally applicable to SBS methods in which bases are differentiallylabeled to identify them. However, while conventional technologiesdifferentiate bases solely by color (e.g., peak position in thewavelength domain), the technology provided herein differentiates basesby differences in the ratio of base labels, e.g., the ratio of signalintensities produced by at least two labels emitting at twodistinguishable wavelengths. For example, in some embodiments all fourbases are labeled at a different ratio with two labels (e.g.,fluorescent moieties).

With respect to sequencing-by-synthesis methods and schemes that finduse, e.g., as appropriately adapted to the methods provided herein,Morozova and Marra provide a review of some such technologies inGenomics 92: 255 (2008); additional discussions are found in Mardis,Annu. Rev. Genomics Hum. Genet. (2008) 9:387-402 and in Fuller, et al.(2009) Nat. Biotechnol. 27: 1013.

More specifically, some embodiments provide for the use of bases labeledat different ratios in an ensemble sequencing-by-synthesis techniquesuch as the following: parallel sequencing of partitioned amplicons (PCTPublication No: WO2006084132); parallel oligonucleotide extension (See,e.g., U.S. Pat. Nos. 5,750,341; 6,306,597); polony sequencing (Mitra etal. (2003) Analytical Biochemistry 320: 55-65; Shendure et al. (2005)Science 309: 1728-1732; U.S. Pat. Nos. 6,432,360, 6,485,944,6,511,803;); the Solexa single base addition technology (see, e.g.,Bennett et al. (2005), Pharmacogenomics 6: 373-382; U.S. Pat. Nos.6,787,308; 6,833,246; herein incorporated by reference in theirentireties), the Lynx massively parallel signature sequencing technology(Brenner et al. (2000). Nat. Biotechnol. 18: 630-634; U.S. Pat. Nos.5,695,934; 5,714,330), and the Adessi PCR colony technology (Adessi etal. (2000). Nucleic Acid Res. 28: E87; WO 00018957).

In an exemplary embodiment, Solexa sequencing is used. In theSolexa/Illumina platform (Voelkerding et al., Clinical Chem., 55:641-658, 2009; MacLean et al., Nature Rev. Microbiol. 7: 287-296; U.S.Pat. Nos. 6,833,246; 7,115,400; 6,969,488; and 6,787,308, each hereinincorporated by reference in its entirety), sequencing data are producedin the form of shorter-length reads. In this method, single-strandedfragmented DNA is end-repaired to generate 5′-phosphorylated blunt ends,followed by Klenow-mediated addition of a single A base to the 3′ end ofthe fragments. A-addition facilitates addition of T-overhang adaptoroligonucleotides, which are subsequently used to capture thetemplate-adaptor molecules on the surface of a flow cell that is studdedwith oligonucleotide anchors. The anchor is used as a PCR primer, butbecause of the length of the template and its proximity to other nearbyanchor oligonucleotides, extension by PCR results in the “arching over”of the molecule to hybridize with an adjacent anchor oligonucleotide toform a bridge structure (and, after several rounds of amplification, acluster) on the surface of the flow cell. These loops of DNA aredenatured and cleaved. Forward strands are then sequenced withreversible dye terminators. The sequence of incorporated nucleotides isdetermined by detection of post-incorporation fluorescence (e.g., bydifferences signal ratios), with each fluor and block removed prior tothe next cycle of dNTP addition. Sequence read length ranges from 36nucleotides to over 50 nucleotides, with overall output exceeding 1billion nucleotide pairs per analytical run.

In some embodiments, a calibration sequence is used to differentiate theratios of signal intensities associated with the bases. For example,such a calibration sequence comprises, in some embodiments, each of thefour bases in a known order so that a sequencing instrument iscalibrated to recognize the signal intensities (due to the label ratios)expected for each of the bases complementary to the calibrationsequence. In some embodiments the calibration sequence is attached tothe beginning of each target nucleic acid to be sequenced. In someembodiments, the calibration sequence is not attached to the targetsequence but is used to calibrate the sequencing instrument beforeacquiring the sequence of the target nucleic acid. In some embodiments,the calibration is used for more than one sequencing run. Thecalibration sequence is any length that provides adequate calibration.In some embodiments the calibration sequence is four bases long; in someembodiments the calibration sequence is 5, 6, 7, 8, 9, 10, 16, 20, 24,28, 32, 64, or more bases long.

Some embodiments provide methods for the detection of molecules or thelabeling of samples using detection reagents labeled with differentratios of at least two labels. Differences in signal ratios identify themolecules and differentiate the molecules from each other. For example,some methods comprise contacting a sample (e.g., a cell, tissue, fluid,etc.) with two or more antibodies wherein each antibody is labeled at adifferent ratio with at least two moieties; some methods comprisecontacting a sample (e.g., a cell, tissue, fluid, etc.) with two or morelabeled oligonucleotide probes wherein each probe is labeled at adifferent ratio with at least two moieties. The methods comprisedifferentiating two or more molecules, samples, tissues, cells, etc.from each other by associating a difference in signal ratio with adetection reagent and thus with the detection reagent target.

Compositions

The technology provides compositions, e.g., compositions of nucleotidebases alone or in combination (e.g., a mixture) wherein the labelingratio differs for at least two of the bases. As noted above, the ratioof signals produced and detected during the SBS reaction variesproportionally with the label ratio for each base. For example, a 2:1label ratio produces a signal at two wavelengths with a 2:1 ratio ofsignal intensities (e.g., peak heights).

In some embodiments, the label ratio differs among the four bases,allowing for differentiating each base from the three others, e.g., aseach base is incorporated in an ensemble SBS reaction and a signal isproduced. As used herein, the “label ratio” refers to the relativefractions of base molecules of one type that are labeled. The labelratio can be any ratio that allows a base to be distinguished fromanother base labeled at a second label ratio. For instance, if thenumber of individual A base molecules (e.g., in a solution) is 100 andthe number of individual A base molecules that are labeled with label Xis 50 and the number of individual A base molecules that are labeledwith label Y is 50, then the label ratio for A is 1:1. In this exemplaryembodiment, the label ratios for the other three bases C, G, and T, are1:2, 2:1, and 1:0, respectively. Various embodiments provide labelratios other than these exemplary values. Indeed, any combination labelratios is contemplated by the technology provided that the four basescan be distinguished from one another based on the differences in thelabel ratios and the subsequent ratios of the signals produced in a SBSreaction. In various embodiments, any of the four bases is labeled atratio that is 1:0, 1:3, 1:2, 1:1, 2:1, 3:1, and 0:1 with two labels,provided the ratios are sufficient to differentiate the bases from oneanother.

Compositions provided by the present technology include solutions offour bases wherein at least two bases are labeled with different ratiosof at least two labels. Embodiments of compositions generally comprise abuffer known in the art and optionally comprise other salts andpreservatives known to those in the art, e.g., to maintain the stabilityof the composition. Various embodiments include compositions comprisingone base or mixtures of 2, 3, 4, or more bases. The bases in thesecompositions are labeled at different ratios with different labels usingidentification schemes as discussed above.

Some embodiments provide a composition comprising a calibrationoligonucleotide comprising or consisting of a known sequence of bases.In some embodiments, the calibration oligonucleotide comprises orconsists of 4, 5, 6, 7, 8 or more bases whose sequence is known. Theoligonucleotide is, in some embodiments, synthesized chemically.

Data Analysis

Some embodiments comprise a computer system upon which embodiments ofthe present teachings may be implemented. In various embodiments, acomputer system includes a bus or other communication mechanism forcommunicating information and a processor coupled with the bus forprocessing information. In various embodiments, the computer systemincludes a memory, which can be a random access memory (RAM) or otherdynamic storage device, coupled to the bus for identifying bases (e.g.,making “base calls”), and instructions to be executed by the processor.Memory also can be used for storing temporary variables or otherintermediate information during execution of instructions to be executedby the processor. In various embodiments, the computer system canfurther include a read only memory (ROM) or other static storage devicecoupled to the bus for storing static information and instructions forthe processor. A storage device, such as a magnetic disk or opticaldisk, can be provided and coupled to the bus for storing information andinstructions.

In various embodiments, the computer system is coupled via the bus to adisplay, such as a cathode ray tube (CRT) or a liquid crystal display(LCD), for displaying information to a computer user. An input device,including alphanumeric and other keys, can be coupled to the bus forcommunicating information and command selections to the processor.Another type of user input device is a cursor control, such as a mouse,a trackball, or cursor direction keys for communicating directioninformation and command selections to the processor and for controllingcursor movement on the display. This input device typically has twodegrees of freedom in two axes, a first axis (e.g., x) and a second axis(e.g., y), that allows the device to specify positions in a plane.

A computer system can perform embodiments of the present technology.Consistent with certain implementations of the present teachings,results can be provided by the computer system in response to theprocessor executing one or more sequences of one or more instructionscontained in the memory. Such instructions can be read into the memoryfrom another computer-readable medium, such as a storage device.Execution of the sequences of instructions contained in the memory cancause the processor to perform the methods described herein.Alternatively, hard-wired circuitry can be used in place of or incombination with software instructions to implement the presentteachings. Thus implementations of the present teachings are not limitedto any specific combination of hardware circuitry and software.

The term “computer-readable medium” as used herein refers to any mediathat participates in providing instructions to the processor forexecution. Such a medium can take many forms, including but not limitedto, non-volatile media, volatile media, and transmission media. Examplesof non-volatile media can include, but are not limited to, optical ormagnetic disks. Examples of volatile media can include, but are notlimited to, dynamic and flash memory. Examples of transmission media caninclude, but are not limited to, coaxial cables, copper wire, and fiberoptics, including the wires that comprise the bus.

Common forms of computer-readable media include, for example, a floppydisk, a flexible disk, hard disk, magnetic tape, or any other magneticmedium, a CD-ROM, any other optical medium, punch cards, paper tape, anyother physical medium with patterns of holes, a RAM, PROM, and EPROM, aFLASH-EPROM, any other memory chip or cartridge, or any other tangiblemedium from which a computer can read.

Various forms of computer readable media can be involved in carrying oneor more sequences of one or more instructions to the processor forexecution. For example, the instructions can initially be carried on themagnetic disk of a remote computer. The remote computer can load theinstructions into its dynamic memory and send the instructions over anetwork connection (e.g., a LAN, a WAN, the internet, a telephone line).A local computer system can receive the data and transmit it to the bus.The bus can carry the data to the memory, from which the processorretrieves and executes the instructions. The instructions received bythe memory may optionally be stored on a storage device either before orafter execution by the processor.

In accordance with various embodiments, instructions configured to beexecuted by a processor to perform a method are stored on acomputer-readable medium. The computer-readable medium can be a devicethat stores digital information. For example, a computer-readable mediumincludes a compact disc read-only memory (CD-ROM) as is known in the artfor storing software. The computer-readable medium is accessed by aprocessor suitable for executing instructions configured to be executed.

In accordance with such a computer system, some embodiments of thetechnology provided herein further comprise functionalities forcollecting, storing, and/or analyzing data (e.g., nucleotide sequencedata). For example, some embodiments contemplate a system that comprisesa processor, a memory, and/or a database for, e.g., storing andexecuting instructions, analyzing imaging data from a SBS reaction,performing calculations using the data, transforming the data, andstoring the data. It some embodiments, a base-calling algorithm assignsa sequence of bases to the data and associates quality scores to basecalls based on a statistical model. In some embodiments, the system isconfigured to assemble a sequence from multiple sub-sequences, in someinstances accounting for overlap and calculating a consensus sequence.In some embodiments, a sequence determined from a SBS reaction isaligned to a reference sequence or to a scaffold.

Many diagnostics involve determining the presence of, or a nucleotidesequence of, one or more nucleic acids. Thus, in some embodiments, anequation comprising variables representing the presence or sequenceproperties of multiple nucleic acids produces a value that finds use inmaking a diagnosis or assessing the presence or qualities of a nucleicacid. As such, in some embodiments this value is presented by a device,e.g., by an indicator related to the result (e.g., an LED, an icon on anLCD, a sound, or the like). In some embodiments, a device stores thevalue, transmits the value, or uses the value for additionalcalculations.

Moreover, in some embodiments a processor is configured to control thesequencing reactions and collect the data (e.g., images). In someembodiments, the processor is used to initiate and/or terminate eachround of sequencing and data collection relating to a sequencingreaction. Some embodiments comprise a processor configured to analyzethe dataset of signal ratios acquired during the SBS reaction anddiscern the sequence of the target nucleic acid and/or of itscomplement.

In some embodiments, a device that comprises a user interface (e.g., akeyboard, buttons, dials, switches, and the like) for receiving userinput is used by the processor to direct a measurement. In someembodiments, the device further comprises a data output for transmitting(e.g., by a wired or wireless connection) data to an externaldestination, e.g., a computer, a display, a network, and/or an externalstorage medium.

In some embodiments, the technology finds use in assaying the presenceof one or more nucleic acids and/or providing the sequence of one ormore nucleic acids. Accordingly, the technology provided herein findsuse in the medical, clinical, and emergency medical fields. In someembodiments a device is used to assay biological samples. In such anassay, the biological sample comprises a nucleic acid and sequencing thenucleic acid is indicative of a state or a property of the sample and,in some embodiments, the subject from which the sample was taken. Somerelevant samples include, but are not limited to, whole blood, lymph,plasma, serum, saliva, urine, stool, perspiration, mucus, tears,cerebrospinal fluid, nasal secretion, cervical or vaginal secretion,semen, pleural fluid, amniotic fluid, peritoneal fluid, middle earfluid, joint fluid, gastric aspirate, a tissue homogenate, a cellhomogenate, or the like.

The sequence of output signals provides the sequence of the synthesizedDNA and, by the rules of base complementarity, also thus provides thesequence of the template strand.

Apparatuses

A further aspect of the invention provides an apparatus for carrying outthe methods or for preparing the compositions of the technology. Suchapparatus might comprise, for example, a plurality of nucleic acidtemplates and primers bound, preferably covalently, to a solid support,together with a nucleic acid polymerase, a plurality of nucleotideprecursors such as those described above, which are labeled according toa label ratio, and a functionality for controlling temperature and/ornucleotide additions. Preferably the apparatus also comprises adetecting functionality for detecting and distinguishing signals fromindividual nucleic acid clusters. Such a detecting functionality mightcomprise a charge-coupled device operatively connected to a magnifyingdevice such as a microscope. Preferably any apparatuses of the inventionare provided in an automated form, e.g., under the control of a programof steps and decisions, e.g., as implemented in computer software.

Some embodiments of such an apparatus include a fluidic delivery andcontrol unit; a sample processing unit; a signal detection unit; and adata acquisition, analysis, and control unit. Various embodiments of theapparatus can provide for automated sequencing that can be used togather sequence information from a plurality of sequences in parallel,e.g., substantially simultaneously.

In various embodiments, the fluidics delivery and control unit includesa reagent delivery system. The reagent delivery system can include areagent reservoir for the storage of various reagents (e.g.,compositions of nucleotides according to the technology). The reagentscan include RNA-based primers, forward/reverse DNA primers,oligonucleotide mixtures for ligation sequencing, nucleotide mixturesfor sequencing-by-synthesis, buffers, wash reagents, blocking reagent,stripping reagents, and the like. Additionally, the reagent deliverysystem can include a pipetting system or a continuous flow system thatconnects the sample processing unit with the reagent reservoir.

In various embodiments, the sample processing unit can include a samplechamber, such as flow cell, a substrate, a micro-array, a multi-welltray, or the like. The sample processing unit can include multiplelanes, multiple channels, multiple wells, or other modes of processingmultiple sample sets substantially simultaneously. Additionally, thesample processing unit can include multiple sample chambers to enableprocessing of multiple runs simultaneously. In particular embodiments,the system can perform signal detection on one sample chamber whilesubstantially simultaneously processing another sample chamber.Additionally, the sample processing unit can include an automationsystem for moving or manipulating the sample chamber.

In various embodiments, the signal detection unit can include an imagingor detection sensor. The signal detection unit can include an excitationsystem to cause a probe, such as a fluorescent dye, to emit a signal.The excitation system can include an illumination source, such as arclamp, a laser, a light emitting diode (LED), or the like. In particularembodiments, the signal detection unit can include optics for thetransmission of light from an illumination source to the sample or fromthe sample to the imaging or detection sensor. Alternatively, the signaldetection unit may not include an illumination source, such as forexample, when a signal is produced spontaneously as a result of asequencing reaction. For example, a signal can be produced by theinteraction of a released moiety, such as a released ion interactingwith an ion sensitive layer, or a pyrophosphate reacting with an enzymeor other catalyst to produce a chemiluminescent signal.

In various embodiments, a data acquisition analysis and control unit canmonitor various system parameters. The system parameters can include thetemperature of various portions of the instrument, such as a sampleprocessing unit or reagent reservoirs, volumes of various reagents, thestatus of various system subcomponents, such as a manipulator, a steppermotor, a pump, or the like, or any combination thereof.

It will be appreciated by one skilled in the art that variousembodiments of such an instrument can be used to practice a variety ofsequencing methods including ligation-based methods, sequencing bysynthesis, single molecule methods, and other sequencing techniques.Ligation sequencing can include single ligation techniques, or changeligation techniques where multiple ligations are performed in sequenceon a single primary. Sequencing by synthesis can include theincorporation of dye labeled nucleotides, chain termination, or thelike. Single molecule techniques can include staggered sequencing, wherethe sequencing reactions are paused to determine the identity of theincorporated nucleotide.

In various embodiments, the sequencing instrument can determine thesequence of a nucleic acid, such as a polynucleotide or anoligonucleotide. The nucleic acid can include DNA or RNA, and can besingle stranded, such as ssDNA and RNA, or double stranded, such asdsDNA or a RNA/cDNA pair. In various embodiments, the nucleic acid caninclude or be derived from a fragment library, a mate pair library, aChIP fragment, or the like. In particular embodiments, the sequencinginstrument can obtain the sequence information from a group ofsubstantially identical nucleic acid molecules.

In various embodiments, the sequencing instrument can output nucleicacid sequencing read data in a variety of different output data filetypes/formats, including, but not limited to: *.fasta, *.csfasta,*seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt, *.sms, *srs and/or *.qv.

Some embodiments comprise a system for reconstructing a nucleic acidsequence in accordance with the various embodiments provided herein. Thesystem can include a nucleic acid sequencer, a sample sequence datastorage, a reference sequence data storage, and an analytics computingdevice/server/node. In various embodiments, the analytics computingdevice/server/node can be a workstation, a mainframe computer, apersonal computer, a mobile device, etc.

The nucleic acid sequencer can be configured to analyze (e.g.,interrogate) a nucleic acid fragment (e.g., single fragment, mate-pairfragment, paired-end fragment, etc.) utilizing all appropriate varietiesof techniques, platforms, or technologies to obtain nucleic acidsequence information, e.g., using an ensemble sequencing by synthesis.In various embodiments, the nucleic acid sequencer can be incommunications with the sample sequence data storage either directly viaa data cable (e.g., a serial cable, a direct cable connection, etc.) orbus linkage or, alternatively, through a network connection (e.g.,Internet, LAN, WAN, VPN, etc.). In various embodiments, the networkconnection can be a “hardwired” physical connection. For example, thenucleic acid sequencer can be communicatively connected (via Category 5(CATS), fiber optic, or equivalent cabling) to a data server that can becommunicatively connected (via CAT5, fiber optic, or equivalent cabling)through the internet and to the sample sequence data storage. In variousembodiments, the network connection can be a wireless network connection(e.g., Wi-Fi, WLAN, etc.), for example, utilizing an 802.11b/g orequivalent transmission format. In practice, the network connectionutilized is dependent upon the particular requirements of the system. Invarious embodiments, the sample sequence data storage can be anintegrated part of the nucleic acid sequencer.

In various embodiments, the sample sequence data storage can be anydatabase storage device, system, or implementation (e.g., data storagepartition, etc.) that is configured to organize and store nucleic acidsequence read data generated by the nucleic acid sequencer such that thedata can be searched and retrieved manually (e.g., by a databaseadministrator/client operator) or automatically by way of a computerprogram/application/software script. In various embodiments, thereference data storage can be any database device, storage system, orimplementation (e.g., data storage partition, etc.) that is configuredto organize and store reference sequences (e.g., whole/partial genome,whole/partial exome, etc.) such that the data can be searched andretrieved manually (e.g., by a database administrator/client operator)or automatically by way of a computer program/application/softwarescript. In various embodiments, the sample nucleic acid sequencing readdata can be stored on the sample sequence data storage and/or thereference data storage in a variety of different data filetypes/formats, including, but not limited to: *.fasta, *.csfasta,*seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt, *.sms, *srs and/or *.qv.

In various embodiments, the sample sequence data storage and thereference data storage are independent standalone devices/systems orimplemented on different devices. In various embodiments, the samplesequence data storage and the reference data storage are implemented onthe same device/system. In various embodiments, the sample sequence datastorage and/or the reference data storage can be implemented on theanalytics computing device/server/node.

The analytics computing device/server/node can be in communications withthe sample sequence data storage and the reference data storage eitherdirectly via a data cable (e.g., serial cable, direct cable connection,etc.) or bus linkage or, alternatively, through a network connection(e.g., Internet, LAN, WAN, VPN, etc.). In various embodiments, theanalytics computing device/server/node can host a reference mappingengine, a de novo mapping module, and/or a tertiary analysis engine. Invarious embodiments, the reference mapping engine can be configured toobtain sample nucleic acid sequence reads from the sample data storageand map them against one or more reference sequences obtained from thereference data storage to assemble the reads into a sequence that issimilar but not necessarily identical to the reference sequence usingall varieties of reference mapping/ alignment techniques and methods.The reassembled sequence can then be further analyzed by one or moreoptional tertiary analysis engines to identify differences in thegenetic makeup (genotype), gene expression, or epigenetic status ofindividuals that can result in large differences in physicalcharacteristics (phenotype). For example, in various embodiments, thetertiary analysis engine can be configured to identify various genomicvariants (in the assembled sequence) due to mutations,recombination/crossover, or genetic drift. Examples of types of genomicvariants include, but are not limited to: single nucleotidepolymorphisms (SNPs), copy number variations (CNVs),insertions/deletions (Indels), inversions, etc.

The optional de novo mapping module can be configured to assemble samplenucleic acid sequence reads from the sample data storage into new andpreviously unknown sequences.

It should be understood, however, that the various engines and moduleshosted on the analytics computing device/server/node can be combined orcollapsed into a single engine or module, depending on the requirementsof the particular application or system architecture. Moreover, invarious embodiments, the analytics computing device/server/node can hostadditional engines or modules as needed by the particular application orsystem architecture.

In various embodiments, the mapping and/or tertiary analysis engines areconfigured to process the nucleic acid and/or reference sequence readsin signal ratio space. In various embodiments, the sample nucleic acidsequencing read and referenced sequence data can be supplied to theanalytics computing device/server/node in a variety of different inputdata file types/formats, including, but not limited to: *.fasta,*.csfasta, *seq.txt, *qseq.txt, *.fastq, *.sff, *prb.txt, *.sms, *srsand/or *.qv.

Uses

The technology provides the use of the methods of the technology, or thecompositions of the technology, for sequencing and/or re-sequencingnucleic acid molecules for gene expression monitoring, genetic diversityprofiling, diagnosis, screening, whole genome sequencing, whole genomepolymorphism discovery and scoring, or any other application involvingthe analysis of nucleic acids where sequence or partial sequenceinformation is relevant.

Kits

A yet further aspect of the invention provides a kit for use insequencing, re-sequencing, gene expression monitoring, genetic diversityprofiling, diagnosis, screening, whole genome sequencing, whole genomepolymorphism discovery and scoring, or any other application involvingthe sequencing of nucleic acids. In some embodiments, kits comprise atleast one plurality of a nucleotide labeled with at least two labelsand, optionally, a calibration oligonucleotide comprising a knownsequence. In some embodiments, a kit is provided for the detection ofmolecules using detection reagents labeled with different label ratios.Differences in signal ratio identify the molecules and differentiate themolecules from each other. For example, some kits comprise an antibodylabeled with at least two labels at a defined ratio; some kits comprisean oligonucleotide probe labeled with at least two labels at a definedratio.

Moreover, processes and systems for sequencing that may be adapted foruse with the technology are described in, for example, U.S. Pat. No.7,405,281, entitled “Fluorescent nucleotide analogs and uses therefor”,issued Jul. 29, 2008 to Xu et al.; U.S. Pat. No. 7,315,019, entitled“Arrays of optical confinements and uses thereof”, issued Jan. 1, 2008to Turner et al.; U.S. Pat. No. 7,313,308, entitled “Optical analysis ofmolecules”, issued Dec. 25, 2007 to Turner et al.; U.S. Pat. No.7,302,146, entitled “Apparatus and method for analysis of molecules”,issued Nov. 27,2007 to Turner et al.; and U.S. Pat. No. 7,170,050,entitled “Apparatus and methods for optical analysis of molecules”,issued Jan. 30, 2007 to Turner et al.; and U.S. Pat. Pub. Nos.20080212960, entitled “Methods and systems for simultaneous real-timemonitoring of optical signals from multiple sources”, filed Oct. 26,2007 by Lundquist et al.; 20080206764, entitled “Flowcell system forsingle molecule detection”, filed Oct. 26, 2007 by Williams et al.;20080199932, entitled “Active surface coupled polymerases”, filed Oct.26, 2007 by Hanzel et al.; 20080199874, entitled “CONTROLLABLE STRANDSCISSION OF MINI CIRCLE DNA”, filed Feb. 11, 2008 by Otto et al.;20080176769, entitled “Articles having localized molecules disposedthereon and methods of producing same”, filed Oct. 26, 2007 by Rank etal.; 20080176316, entitled “Mitigation of photodamage in analyticalreactions”, filed Oct. 31, 2007 by Eid et al.; 20080176241, entitled“Mitigation of photodamage in analytical reactions”, filed Oct. 31, 2007by Eid et al.; 20080165346, entitled “Methods and systems forsimultaneous real-time monitoring of optical signals from multiplesources”, filed Oct. 26, 2007 by Lundquist et al.; 20080160531, entitled“Uniform surfaces for hybrid material substrates and methods for makingand using same”, filed Oct. 31, 2007 by Korlach; 20080157005, entitled“Methods and systems for simultaneous real-time monitoring of opticalsignals from multiple sources”, filed Oct. 26, 2007 by Lundquist et al.;20080153100, entitled “Articles having localized molecules disposedthereon and methods of producing same”, filed Oct. 31, 2007 by Rank etal.; 20080153095, entitled “CHARGE SWITCH NUCLEOTIDES”, filed Oct. 26,2007 by Williams et al.; 20080152281, entitled “Substrates, systems andmethods for analyzing materials”, filed Oct. 31, 2007 by Lundquist etal.; 20080152280, entitled “Substrates, systems and methods foranalyzing materials”, filed Oct. 31, 2007 by Lundquist et al.;20080145278, entitled “Uniform surfaces for hybrid material substratesand methods for making and using same”, filed Oct. 31, 2007 by Korlach;20080128627, entitled “SUBSTRATES, SYSTEMS AND METHODS FOR ANALYZINGMATERIALS”, filed Aug. 31, 2007 by Lundquist et al.; 20080108082,entitled “Polymerase enzymes and reagents for enhanced nucleic acidsequencing”, filed Oct. 22, 2007 by Rank et al.; 20080095488, entitled“SUBSTRATES FOR PERFORMING ANALYTICAL REACTIONS”, filed Jun. 11, 2007 byFoquet et al.; 20080080059, entitled “MODULAR OPTICAL COMPONENTS ANDSYSTEMS INCORPORATING SAME”, filed Sep. 27, 2007 by Dixon et al.;20080050747, entitled “Articles having localized molecules disposedthereon and methods of producing and using same”, filed Aug. 14, 2007 byKorlach et al.; 20080032301, entitled “Articles having localizedmolecules disposed thereon and methods of producing same”, filed Mar.29, 2007 by Rank et al.; 20080030628, entitled “Methods and systems forsimultaneous real-time monitoring of optical signals from multiplesources”, filed Feb. 9, 2007 by Lundquist et al.; 20080009007, entitled“CONTROLLED INITIATION OF PRIMER EXTENSION”, filed Jun. 15, 2007 by Lyleet al.; 20070238679, entitled “Articles having localized moleculesdisposed thereon and methods of producing same”, filed Mar. 30, 2006 byRank et al.; 20070231804, entitled “Methods, systems and compositionsfor monitoring enzyme activity and applications thereof”, filed Mar. 31,2006 by Korlach et al.; 20070206187, entitled “Methods and systems forsimultaneous real-time monitoring of optical signals from multiplesources”, filed Feb. 9, 2007 by Lundquist et al.; 20070196846, entitled“Polymerases for nucleotide analogue incorporation”, filed Dec. 21, 2006by Hanzel et al.; 20070188750, entitled “Methods and systems forsimultaneous real-time monitoring of optical signals from multiplesources”, filed Jul. 7, 2006 by Lundquist et al.; 20070161017, entitled“MITIGATION OF PHOTODAMAGE IN ANALYTICAL REACTIONS”, filed Dec. 1, 2006by Eid et al.; 20070141598, entitled “Nucleotide Compositions and UsesThereof”, filed Nov. 3, 2006 by Turner et al.; 20070134128, entitled“Uniform surfaces for hybrid material substrate and methods for makingand using same”, filed Nov. 27, 2006 by Korlach; 20070128133, entitled“Mitigation of photodamage in analytical reactions”, filed Dec. 2, 2005by Eid et al.; 20070077564, entitled “Reactive surfaces, substrates andmethods of producing same”, filed Sep. 30, 2005 by Roitman et al.;20070072196, entitled “Fluorescent nucleotide analogs and usestherefore”, filed Sep. 29, 2005 by Xu et al; and 20070036511, entitled“Methods and systems for monitoring multiple optical signals from asingle source”, filed Aug. 11, 2005 by Lundquist et al.; and Korlach etal. (2008) “Selective aluminum passivation for targeted immobilizationof single DNA polymerase molecules in zero-mode waveguidenanostructures” PNAS 105(4): 1176-81, all of which are hereinincorporated by reference in their entireties.

Various modifications and variations of the described compositions,methods, and uses of the technology will be apparent to those skilled inthe art without departing from the scope and spirit of the technology asdescribed. Although the technology has been described in connection withspecific exemplary embodiments, it should be understood that theinvention as claimed should not be unduly limited to such specificembodiments. Indeed, various modifications of the described modes forcarrying out the invention that are obvious to those skilled in relatedfields are intended to be within the scope of the following claims.

We claim:
 1. A composition comprising a plurality of a nucleotide basewherein a first portion of the plurality is labeled with a first labeland a second portion of the plurality is labeled with a second label. 2.The composition of claim 1 further comprising a second plurality of asecond nucleotide base wherein a third portion of the second pluralityis labeled with the first label and a fourth portion of the secondplurality is labeled with the second label and a first ratio of thefirst portion relative to the second portion is different than a secondratio of the third portion relative to the fourth portion.
 3. Thecomposition of claim 2 further comprising a third plurality of a thirdnucleotide base and a fourth plurality of a fourth nucleotide base,wherein a fifth portion of the third plurality is labeled with the firstlabel and a sixth portion of the third plurality is labeled with thesecond label, and a seventh portion of the fourth plurality is labeledwith the first label and an eighth portion of the fourth plurality islabeled with the second label and a third ratio of the fifth portionrelative to the sixth portion is different than a ratio of the seventhportion relative to the eighth portion and the third and the fourthratios are both different than the first and the second ratios.
 4. Thecomposition of claim 3 wherein the first nucleotide base is A, thesecond nucleotide base is C, the third nucleotide base is G, and thefourth nucleotide base is T.
 5. The composition of claim 1 wherein thelabel is a fluorescent moiety.
 6. The composition of claim 1 furthercomprising a target nucleic acid, a sequencing primer, and a polymerase.7. The composition of claim 1 further comprising a nucleic acidcomprising the nucleotide base.
 8. A system for sequencing a nucleicacid, wherein the system comprises: a) a composition comprising aplurality of a nucleotide base wherein a first portion of the pluralityis labeled with a first label and a second portion of the plurality islabeled with a second label; and b) a calibration oligonucleotide. 9.The system of claim 8 further comprising a sequencing apparatus.
 10. Thesystem of claim 8 further comprising a processor configured to associatea signal ratio with a nucleotide base.
 11. The system of claim 8 furthercomprising an output functionality to provide a nucleotide sequence ofthe nucleic acid.
 12. The system of claim 8 further comprising a secondplurality of a second nucleotide base, a third plurality of a thirdnucleotide base, and a fourth plurality of a fourth nucleotide base,wherein a third portion of the second plurality is labeled with thefirst label and a fourth portion of the second plurality is labeled withthe second label, a fifth portion of the third plurality is labeled withthe first label and a sixth portion of the third plurality is labeledwith the second label, and a seventh portion of the fourth plurality islabeled with the first label and an eighth portion of the fourthplurality is labeled with the second label and a first ratio of thefirst portion relative to the second portion, a second ratio of thethird portion to the fourth portion, a third ratio of the fifth portionto the sixth portion, and a fourth ratio of the seventh portion to theeighth portion are different from one another.
 13. The system of claim 8further comprising a functionality to detect the first label and thesecond label.
 14. The system of claim 12 further comprising afunctionality to differentiate the nucleotide base, the secondnucleotide base, the third nucleotide base, and the fourth nucleotidebase from one another.
 15. A kit for sequencing a nucleic acid, whereinthe kit comprises: a) a composition comprising a plurality of anucleotide base wherein a first portion of the plurality is labeled witha first label and a second portion of the plurality is labeled with asecond label; and b) a calibration oligonucleotide.
 16. The kit of claim15 further comprising a second plurality of a second nucleotide base, athird plurality of a third nucleotide base, and a fourth plurality of afourth nucleotide base, wherein a third portion of the second pluralityis labeled with the first label and a fourth portion of the secondplurality is labeled with the second label, a fifth portion of the thirdplurality is labeled with the first label and a sixth portion of thethird plurality is labeled with the second label, and a seventh portionof the fourth plurality is labeled with the first label and an eighthportion of the fourth plurality is labeled with the second label and afirst ratio of the first portion relative to the second portion, asecond ratio of the third portion to the fourth portion, a third ratioof the fifth portion to the sixth portion, and a fourth ratio of theseventh portion to the eighth portion are different from one another.